composite task
- North America > United States > Utah > Utah County > Provo (0.04)
- North America > United States > Massachusetts (0.04)
Evaluating Multimodal Large Language Models with Daily Composite Tasks in Home Environments
Zhang, Zhenliang, Wang, Yuxi, Xie, Hongzhao, Zhao, Shiyun, Liu, Mingyuan, Lu, Yujie, He, Xinyi, Cheng, Zhenku, Peng, Yujia
A key feature differentiating artificial general intelligence (AGI) from traditional AI is that AGI can perform composite tasks that require a wide range of capabilities. Although embodied agents powered by multimodal large language models (MLLMs) offer rich perceptual and interactive capabilities, it remains largely unexplored whether they can solve composite tasks. In the current work, we designed a set of composite tasks inspired by common daily activities observed in early childhood development. Within a dynamic and simulated home environment, these tasks span three core domains: object understanding, spatial intelligence, and social activity. We evaluated 17 leading proprietary and open-source MLLMs on these tasks. The results consistently showed poor performance across all three domains, indicating a substantial gap between current capabilities and general intelligence requirements. Together, our tasks offer a preliminary framework for evaluating the general capabilities of embodied agents, marking an early but significant step toward the development of embodied MLLMs and their real-world deployment.
- Health & Medicine > Therapeutic Area (0.68)
- Education (0.46)
Internal Chain-of-Thought: Empirical Evidence for Layer-wise Subtask Scheduling in LLMs
Yang, Zhipeng, Li, Junzhuo, Xia, Siyu, Hu, Xuming
We show that large language models (LLMs) exhibit an $\textit{internal chain-of-thought}$: they sequentially decompose and execute composite tasks layer-by-layer. Two claims ground our study: (i) distinct subtasks are learned at different network depths, and (ii) these subtasks are executed sequentially across layers. On a benchmark of 15 two-step composite tasks, we employ layer-from context-masking and propose a novel cross-task patching method, confirming (i). To examine claim (ii), we apply LogitLens to decode hidden states, revealing a consistent layerwise execution pattern. We further replicate our analysis on the real-world $\text{TRACE}$ benchmark, observing the same stepwise dynamics. Together, our results enhance LLMs transparency by showing their capacity to internally plan and execute subtasks (or instructions), opening avenues for fine-grained, instruction-level activation steering.
COGITAO: A Visual Reasoning Framework To Study Compositionality & Generalization
Taoudi-Benchekroun, Yassine, Troyan, Klim, Sager, Pascal, Gerber, Stefan, Tuggener, Lukas, Grewe, Benjamin
The ability to compose learned concepts and apply them in novel settings is key to human intelligence, but remains a persistent limitation in state-of-the-art machine learning models. To address this issue, we introduce COGITAO, a modular and extensible data generation framework and benchmark designed to systematically study compositionality and generalization in visual domains. Drawing inspiration from ARC-AGI's problem-setting, COGITAO constructs rule-based tasks which apply a set of transformations to objects in grid-like environments. It supports composition, at adjustable depth, over a set of 28 interoperable transformations, along with extensive control over grid parametrization and object properties. This flexibility enables the creation of millions of unique task rules -- surpassing concurrent datasets by several orders of magnitude -- across a wide range of difficulties, while allowing virtually unlimited sample generation per rule. We provide baseline experiments using state-of-the-art vision models, highlighting their consistent failures to generalize to novel combinations of familiar elements, despite strong in-domain performance. COGITAO is fully open-sourced, including all code and datasets, to support continued research in this field.
- Europe > Switzerland > Zürich > Zürich (0.04)
- Oceania > Australia > Victoria > Melbourne (0.04)
- North America > United States > Massachusetts > Suffolk County > Boston (0.04)
- (5 more...)
- Information Technology > Artificial Intelligence > Vision (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
- Information Technology > Artificial Intelligence > Natural Language (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- North America > United States > Utah > Utah County > Provo (0.04)
- North America > United States > Massachusetts (0.04)
Do Large Language Models Have Compositional Ability? An Investigation into Limitations and Scalability
Xu, Zhuoyan, Shi, Zhenmei, Liang, Yingyu
Large language models (LLMs) have emerged as powerful tools for many AI problems and exhibit remarkable in-context learning (ICL) capabilities. Compositional ability, solving unseen complex tasks that combine two or more simple tasks, is an essential reasoning ability for Artificial General Intelligence. Despite the tremendous success of LLMs, how they approach composite tasks, especially those not encountered during the pretraining phase, remains an open and largely underexplored question. In this study, we delve into the ICL capabilities of LLMs on composite tasks, with only simple tasks as in-context examples. We develop a test suite of composite tasks including linguistic and logical challenges and perform empirical studies across different LLM families. We observe that models exhibit divergent behaviors: (1) For simpler composite tasks that apply distinct mapping mechanisms to different input segments, the models demonstrate decent compositional ability, while scaling up the model enhances this ability; (2) for more complex composite tasks involving reasoning multiple steps, where each step represents one task, models typically underperform, and scaling up generally provides no improvements. We offer theoretical analysis in a simplified setting, explaining that models exhibit compositional capability when the task handles different input parts separately. We believe our work sheds new light on the capabilities of LLMs in solving composite tasks regarding the nature of the tasks and model scale. Our dataset and code are available at {\url{https://github.com/OliverXUZY/LLM_Compose}}.
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)
- South America > Colombia > Meta Department > Villavicencio (0.04)
- Asia > China > Guangxi Province > Nanning (0.04)
- (7 more...)
RoboCasa: Large-Scale Simulation of Everyday Tasks for Generalist Robots
Nasiriany, Soroush, Maddukuri, Abhiram, Zhang, Lance, Parikh, Adeet, Lo, Aaron, Joshi, Abhishek, Mandlekar, Ajay, Zhu, Yuke
Recent advancements in Artificial Intelligence (AI) have largely been propelled by scaling. In Robotics, scaling is hindered by the lack of access to massive robot datasets. We advocate using realistic physical simulation as a means to scale environments, tasks, and datasets for robot learning methods. We present RoboCasa, a large-scale simulation framework for training generalist robots in everyday environments. RoboCasa features realistic and diverse scenes focusing on kitchen environments. We provide thousands of 3D assets across over 150 object categories and dozens of interactable furniture and appliances. We enrich the realism and diversity of our simulation with generative AI tools, such as object assets from text-to-3D models and environment textures from text-to-image models. We design a set of 100 tasks for systematic evaluation, including composite tasks generated by the guidance of large language models. To facilitate learning, we provide high-quality human demonstrations and integrate automated trajectory generation methods to substantially enlarge our datasets with minimal human burden. Our experiments show a clear scaling trend in using synthetically generated robot data for large-scale imitation learning and show great promise in harnessing simulation data in real-world tasks. Videos and open-source code are available at https://robocasa.ai/
MeteoRA: Multiple-tasks Embedded LoRA for Large Language Models
Xu, Jingwei, Lai, Junyu, Huang, Yunpeng
The pretrain+fine-tune paradigm is foundational in deploying large language models (LLMs) across a diverse range of downstream applications. Among these, Low-Rank Adaptation (LoRA) stands out for its parameter-efficient fine-tuning (PEFT), producing numerous off-the-shelf task-specific LoRA adapters. However, this approach requires explicit task intention selection, posing challenges for automatic task sensing and switching during inference with multiple existing LoRA adapters embedded in a single LLM. In this work, we introduce MeteoRA (Multiple-Tasks embedded LoRA), a scalable multi-knowledge LoRA fusion framework designed for LLMs. MeteoRA integrates various LoRA adapters in a Mixture-of-Experts (MoE) style into the base LLM, enabling the model to automatically select the most pertinent adapter based on the task input. This advancement significantly enhances the LLM's capability to handle composite tasks that require different adapters to solve various components of the problem. Our evaluations, featuring the LlaMA2-13B and LlaMA3-8B base models equipped with off-the-shelf 28 LoRA adapters through MeteoRA, demonstrate equivalent performance with the individual adapters. Furthermore, both base models equipped with MeteoRA achieve superior performance in sequentially solving composite tasks with ten problems in only a single inference process, highlighting the ability of timely intention switching in MeteoRA embedded LLMs.
- North America > United States > Iowa (0.04)
- Atlantic Ocean > North Atlantic Ocean > English Channel (0.04)
- Asia > Middle East > Iraq > Baghdad Governorate > Baghdad (0.04)
- (14 more...)
Bayesian Hierarchical Reinforcement Learning
We define priors on the primitive environment model and on task pseudo-rewards. Since models for composite tasks can be complex, we use a mixed model-based/model-free learning approach to find an optimal hierarchical policy. We show empirically that (i) our approach results in improved convergence over non-Bayesian baselines, (ii) using both task hierarchies and Bayesian priors is better than either alone, (iii) taking advantage of the task hierarchy reduces the computational cost of Bayesian reinforcement learning and (iv) in this framework, task pseudo-rewards can be learned instead of being manually specified, leading to hierarchically optimal rather than recursively optimal policies.
- North America > United States > Ohio > Cuyahoga County > Cleveland (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- Information Technology > Artificial Intelligence > Machine Learning > Reinforcement Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (1.00)
- Information Technology > Artificial Intelligence > Representation & Reasoning > Uncertainty > Bayesian Inference (0.97)
A Resource Model For Neural Scaling Law
Song, Jinyeop, Liu, Ziming, Tegmark, Max, Gore, Jeff
Neural scaling laws characterize how model performance improves as the model size scales up. Inspired by empirical observations, we introduce a resource model of neural scaling. A task is usually composite hence can be decomposed into many subtasks, which compete for resources (measured by the number of neurons allocated to subtasks). On toy problems, we empirically find that: (1) The loss of a subtask is inversely proportional to its allocated neurons. (2) When multiple subtasks are present in a composite task, the resources acquired by each subtask uniformly grow as models get larger, keeping the ratios of acquired resources constants. We hypothesize these findings to be generally true and build a model to predict neural scaling laws for general composite tasks, which successfully replicates the neural scaling law of Chinchilla models reported in arXiv:2203.15556. We believe that the notion of resource used in this paper will be a useful tool for characterizing and diagnosing neural networks.
- North America > United States > Massachusetts > Middlesex County > Cambridge (0.14)
- Asia > Middle East > Jordan (0.04)